Skip to content

Conversation

@sophiatev
Copy link
Contributor

@sophiatev sophiatev commented Jul 14, 2025

This PR introduces enabling extended sessions for orchestrations in the .NET isolated framework. The way this is achieved is via an IMemoryCache which maintains the extended sessions state in memory. A cache entry is evicted if it has not been accessed for the user-specified extendedSessionIdleTimeoutInSeconds, which is communicated to the dotnet SDK via the properties field of the OrchestratorRequest. The properties field is also used to communicate:

  • ExtendedSession: whether or not the orchestration request is within an extended session (in which case the worker will attempt to get the extended session state from the cache, and if it cannot find it, it will add it to the cache),
  • IncludePastEvents: whether or not past orchestration events were included in the orchestration request.

The latter is necessary to detect the edge case scenario where a worker has since evicted an extended session from its cache but the host assumes that the worker still has the state in memory and does not include an orchestration history with the request (this should be very rare but can happen if e.g. there is a network delay when sending the orchestration request to the worker, or something along those lines). If there is no orchestration history (this is it the first execution of the orchestration), then everything is fine - the worker will play the new events and create a cache entry for the extended session (IncludePastEvents will be true, but there is simply nothing to include). If there is an orchestration history but it was not included with the request (IncludePastEvents is false), then the worker will not attempt to execute the request since it lacks the history it needs to replay the orchestration up to the execution. It will set the new requiresHistory field in the OrchestratorResponse to true, in which case the host will end the extended session via a SessionAbortedException, and so the next time the orchestration request is sent a history will be included.

Other PRs:

Open questions

Currently the way we attempt to extract the properties from the OrchestratorRequest.Properties field is via string literals. This is obviously unideal - ideally we would define constants somewhere for these property keys instead. The problem is that the way WebJobs actually creates this Properties field is to use the field names of the RemoteOrchestratorConfiguration class, and obviously we do not want to import this class into the SDK and use to it figure out what these string literals should be.

Any ideas how to remedy this? Should we make the OrchestratorRequest.Properties field another way?

Design Callout

There are several avenues by which the worker can inform the host that it needs an orchestration history in the case that the worker has ended the extended session before the host. The worker will evict an extended session after the user-specified extendedSessionIdleTimeoutInSeconds expires. More specifically, what this means is that if the extended session is not accessed within that timeframe, the worker will evict it from the cache. In the meantime, it has to send the result of the orchestration work item back to the host, the host has to process this and commit it to storage, then wait for new orchestration messages, then send the worker the next work item once new messages arrive. From the host's perspective, the extendedSessionIdleTimeoutInSeconds applies only to the amount of time we wait for new orchestration messages to arrive before ending the extended session. All that to say, there could perhaps be an appreciable number of situations where the worker ends the extended session before the host and will require a history.

The current approach is perhaps the simplest - have the worker inform the host via the requiresHistory flag that it needs a history, in which case the host throws a SessionAbortedException from OutOfProcMiddleware. DT.Core already has all the logic to handle a SessionAbortedException and retry the work item. The cons of this approach are that the SessionAbortedException surfaces to the logs and may alarm customers. There is also an added delay that comes from having DT.Core abort the orchestration work item entirely and try it again later in this approach. All that being said, perhaps the number of situations where this occurs could be reduced by just advising customers to up their extendedSessionIdleTimeoutInSeconds count in the isolated model.

An alternative is to have the worker call StreamInstanceHistory in the case that it has since ended the extended session and needs a history. This will not surface an exception to the customer, and will lead to less of a delay in processing the work item. But it could be much more complicated to implement as I am not sure how to get a client over to the GrpcOrchestrationRunner. to accomplish this, and may require additional edge-case handling for all the network issues that could arise.

Manual testing done thus far

  • If extended sessions are not enabled, everything still proceeds as expected (the orchestration history is replayed upon every execution)
  • If an extended session expires, a history is sent along with the orchestration request to the worker. The worker replays the orchestration and then saves the state in a new extended session (state is only saved if the orchestration is not completed)
  • Upon completion of an orchestration, the extended session state is evicted from the cache
  • If the worker ends the extended session before the host does (this should be very rare but can happen if e.g. there is a network delay when sending the orchestration request to the worker, or something along those lines), then it will throw a SessionAbortedException. The work item will then be retried again.
  • Other OOProc scenarios still work as expected (they do not have extended sessions enabled so the orchestration is replayed upon every request. If they attempt to enable extended sessions an error occurs upon startup indicating that the extended session feature does not work for these other languages).

Performance testing

Two scenarios were run in in-process with extended sessions enabled/disabled, and in isolated with extended sessions enabled/disabled. Multiple trials were run, and the time it took to complete the orchestration was recorded as well as the number of times the history was loaded in each of these settings (with extendedSessionIdleTimeoutInSeconds set to 30 seconds). The two scenarios tested were

  • Fan out to a large number of tasks (30,000) and then fan-in. For both isolated and in-process, without extended sessions this took about 35 minutes and with extended sessions around 25 minutes. In the isolated model there were consistently two history loads with ES enabled, with the second corresponding to a SessionAbortedException being thrown (so the worker ends the extended session before the host, and as expected, throws the exception which leads to a retry of the work item, this time with a history attached). In in-process, there was just one history load. Without ES, there were about 100 history loads in each.
  • Run a large number of sequential tasks (1000) with large payloads (10,000 character-length strings). For both isolated and in-process, without extended sessions this took about 30 minutes and with extended sessions enabled around 5 minutes. Without ES there were around 100 history loads, and with ES just 1. Interestingly, in isolated this scenario actually ran somewhat faster both with/without ES as opposed to in-process.

Copy link
Collaborator

@andystaples andystaples left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things I think we should address in this PR before merging.

Sophia Tevosyan added 3 commits September 15, 2025 18:39
…perties are not specified, also added another test to make sure that extended sessions aren't stored if isExtendedSessions is false
@sophiatev sophiatev merged commit dd84948 into main Sep 17, 2025
4 checks passed
@sophiatev sophiatev deleted the stevosyan/extended-sessions-for-orchestrations-isolated branch September 17, 2025 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants